Confidence Decision Trees via Online and Active Learning for Streaming (BIG) Data

نویسنده

  • Rocco De Rosa
چکیده

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeffding trees have been already clarified, a general and rigorous study of confidence intervals for splitting criteria is missing. We fill this gap by deriving accurate confidence intervals to estimate the splitting gain in decision tree learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. Our confidence intervals depend in a more detailed way on the tree parameters. We also extend our confidence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We furnish theoretical guarantee bounding the probability that the classification is nonoptimal learning the decision tree via our selective sampling strategy. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by other techniques and our active learning module permits to save labeling cost. In addition, comparing our labeling strategy with recent methods, we show that our approach is more robust and consistent respect all the other techniques applied to incremental decision trees.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Confidence Decision Trees via Online and Active Learning for Streaming Data

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

Online tree-based ensembles and option trees for regression on evolving data streams

The emergence of ubiquitous sources of streaming data has given rise to the popularity of algorithms for online machine learning. In that context, Hoeffding trees represent the state-of-the-art algorithms for online classification. Their popularity stems in large part from their ability to process large quantities of data with a speed that goes beyond the processing power of any other streaming...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

Distributed Decision Tree Learning for Mining Big Data Streams

Web companies need to effectively analyse big data in order to enhance the experiences of their users. They need to have systems that are capable of handling big data in term of three dimensions: volume as data keeps growing, variety as the type of data is diverse, and velocity as the is continuously arriving very fast into the systems. However, most of the existing systems have addressed at mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1604.03278  شماره 

صفحات  -

تاریخ انتشار 2016